State Farm Distracted Drivers

Prev Exercises: Udacity:DeepLearning:TensorFlow:notMNIST

Baseline

notMNIST: This notebook uses the notMNIST dataset to be used with python experiments. This dataset is designed to look like the classic MNIST dataset, while looking a little more like real data: it's a harder task, and the data is a lot less 'clean' than MNIST.

In [1]:
import sys
print sys.version

from joblib import Parallel, delayed  
import multiprocessing

nCores = multiprocessing.cpu_count() - 2 # Allow other apps to run
print 'nCores: %d' % (nCores)
2.7.11 (default, Jan 28 2016, 14:07:46) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]
nCores: 14
In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display, Image

from datetime import datetime, time
import numpy as np
import os
import pandas as pd
from scipy import ndimage
from six.moves.urllib.request import urlretrieve
from six.moves import cPickle as pickle
from skimage import color as sk_color
from skimage import io as sk_io
from skimage import transform as sk_transform
import tarfile

%run img_utils.py

Analytics Specs

This Project

The specs should be in img_glbSpecs_SFDD

In [3]:
print type('string')
<type 'str'>
In [4]:
%run img_glbSpec_SFDD_ImgSz_64.py
imported img_glbSpec_SFDD_Img_Sz_64.py
In [5]:
#print 'glbDataFile: %s' % (glbDataFile)

print 'glbImg: %s' % (glbImg)

print 'glbRspClass: %s' % (glbRspClass)
print 'glbRspClassN: %d' % (glbRspClassN)

print 'glbPickleFile: %s' % (glbPickleFile)
glbImg: {'color': False, 'crop': {'x': (80, 560)}, 'shape': (480, 640, 3), 'pxlDepth': 255.0, 'center_scale': True, 'size': 64}
glbRspClass: ['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9']
glbRspClassN: 10
glbPickleFile: {'models': 'data/img_M_SFDD_ImgSz_64.pickle', 'data': 'data/img_D_SFDD_ImgSz_64.pickle'}

notMNIST

In [6]:
# glbDataURL = 'http://yaroslavvb.com/upload/notMNIST/'
# glbImg['size'] = 32

Import Data

First, we'll download the dataset to our local machine.

In [7]:
print type('string')
<type 'str'>
In [8]:
def maybe_download(url, filename, expected_bytes = None):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists('data/' + filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat('data/' + filename)
  verified = False
  if (expected_bytes == None):
    if (statinfo.st_size > 0):
        verified = True
  else:      
    if (statinfo.st_size == expected_bytes):
        verified = True
    
  if verified:
    print('Found and verified', 'data/' + filename)
  else:
    raise Exception(
      'Failed to verify' + filename + '. Can you get to it with a browser?')
  return 'data/' + filename

dataFNm = maybe_download(glbDataFile['url'], glbDataFile['filename'])
('Found and verified', 'data/imgs.zip')
In [9]:
# url = 'http://yaroslavvb.com/upload/notMNIST/'

# def maybe_download(url, filename, expected_bytes):
#   """Download a file if not present, and make sure it's the right size."""
#   if not os.path.exists(filename):
#     filename, _ = urlretrieve(url + filename, filename)
#   statinfo = os.stat(filename)
#   if statinfo.st_size == expected_bytes:
#     print('Found and verified', filename)
#   else:
#     raise Exception(
#       'Failed to verify' + filename + '. Can you get to it with a browser?')
#   return filename

# train_filename = maybe_download('data/notMNIST_large.tar.gz', 247336696)
# test_filename = maybe_download('data/notMNIST_small.tar.gz', 8458043)

Extract the dataset from the compressed downloaded file(s).

In [10]:
def extract(filename, num_classes):
  print("Figure out automatically if data needs to be extracted")
  return
    
  tar = tarfile.open(filename)
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
  print('Extracting data for %s. This may take a while. Please wait.' % root)
  sys.stdout.flush()
  tar.extractall()
  tar.close()
  # My edits: data_folders needs to be modified for the correct path
  data_folders = [
    os.path.join(root, d) for d in sorted(os.listdir(root)) if d != '.DS_Store']
  if len(data_folders) != num_classes:
    raise Exception(
      'Expected %d folders, one per class. Found %d instead.' % (
        num_classes, len(data_folders)))
  print(data_folders)
  return data_folders

if (glbDataFile['extract']):
    train_folders = extract(os.getcwd() + train_filename, glbRspClassN)
    test_folders  = extract(os.getcwd() + test_filename , glbRspClassN)

notMNINST:
Extraction give you a set of directories, labelled A through J. The data consists of characters rendered in a variety of fonts on a 28x28 image. The labels are limited to 'A' through 'J' (10 classes). The training set has about 500k and the obsNewSet 19000 labelled examples. Given these sizes, it should be possible to train models quickly on any machine.

Sample images

Let's take a peek at some of the data to make sure it looks sensible.

In [11]:
print type('string')
<type 'str'>
In [12]:
driverDf = pd.read_csv('data/driver_imgs_list.csv')
print driverDf.describe()
# print driverDf.shape
print driverDf.head()
print driverDf.tail()
       subject classname            img
count    22424     22424          22424
unique      26        10          22424
top       p021        c0  img_97080.jpg
freq      1237      2489              1
  subject classname            img
0    p002        c0  img_44733.jpg
1    p002        c0  img_72999.jpg
2    p002        c0  img_25094.jpg
3    p002        c0  img_69092.jpg
4    p002        c0  img_92629.jpg
      subject classname            img
22419    p081        c9  img_56936.jpg
22420    p081        c9  img_46218.jpg
22421    p081        c9  img_25946.jpg
22422    p081        c9  img_67850.jpg
22423    p081        c9   img_9684.jpg
In [13]:
trnFoldersPth = os.getcwd() + '/data/' + glbDataFile['trnFoldersPth']
newFoldersPth = os.getcwd() + '/data/' + glbDataFile['newFoldersPth']
# print(trnFoldersPth)
# print(newFoldersPth)

Display sample train images

Collect data corrections into glbDataScrub

In [14]:
print type('string')
<type 'str'>
In [56]:
def myreadImage(filePthNm):
    img = sk_io.imread(filePthNm)
    try:
        assert img.shape == glbImg['shape'], 'img.shape: %s' % \
            (img.shape)
        assert np.min(img) >= 0, 'img.min: %.4f' % \
            (np.min(img))
        assert np.max(img) <= glbImg['pxlDepth'], 'img.min: %.4f' % \
            (np.max(img))
    except AssertionError, e:
        print 'filePthNm: %s' % (filePthNm)
        print e
        raise
        
    return(img)

plt.imshow(myreadImage(trnFoldersPth + '/c0/img_15117.jpg'))
# plt.imshow(myreadImage(trnFoldersPth + '/c8/img_67168.jpg'))
# plt.imshow(myreadImage(trnFoldersPth + '/c9/img_84986.jpg'))
# plt.imshow(myreadImage(trnFoldersPth + '/c9/img_95888.jpg'))
Out[56]:
<matplotlib.image.AxesImage at 0x1163dbf50>
In [16]:
print type('string')
<type 'str'>
In [17]:
smpClsImg = {}; smpN = 3
for cls in glbRspClass:
    clsImg = {}
#     print 'Class: %s' % (cls)
    clsPth = trnFoldersPth + '/' + cls
    onlyfiles = [f for f in os.listdir(clsPth) 
                    if os.path.isfile(os.path.join(clsPth, f))]
    for ix in np.random.randint(0, len(onlyfiles), size = smpN):
#         print '  %s:' % (onlyfiles[ix])
#         img = sk_io.imread(clsPth + '/' + onlyfiles[ix])
#         assert img.shape == (480, 640, 3), 'img.shape: %s' % (img.shape)
#         assert np.min(img) == 0, 'img.min: %.4f' % (np.min(img))
#         assert np.max(img) == glbImg['pxlDepth'], 'img.min: %.4f' % (np.max(img))        
        clsImg[onlyfiles[ix]] = myreadImage(clsPth + '/' + onlyfiles[ix])
#         jpgfile = Image(clsPth + '/' + onlyfiles[ix], format = 'jpg', 
#                         width = glbImg['size'] * 4, height = glbImg['size'] * 4)
#         display(jpgfile)

    smpClsImg[cls] = clsImg
    
# print smpClsImg    
        
figs, axes = plt.subplots(len(glbRspClass), smpN, 
                          figsize=(5 * smpN, 4 * len(glbRspClass)))
[(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) for ax in axes.flatten()]
for i, cls in enumerate(smpClsImg.keys()):
    for j, imgFileName in enumerate(smpClsImg[cls].keys()):
        axes[i, j].imshow(smpClsImg[cls][imgFileName])
        axes[i, j].set_title(cls + ':' + imgFileName)
In [18]:
print type('string')
<type 'str'>
In [19]:
smpSbtImg = {}; smpN = 3
for sbt in driverDf['subject'].values[
        np.random.randint(0, len(driverDf['subject'].values), 
                          size = smpN)]:
    sbtImg = {}
#     print '  subject: %s' % (sbt)
    driverSbtDf = driverDf[driverDf['subject'] == sbt]
#     print driverSbtDf.shape
    
    clsPth = trnFoldersPth + '/' + cls
    onlyfiles = [f for f in os.listdir(clsPth) 
                    if os.path.isfile(os.path.join(clsPth, f))]
    for cls in driverSbtDf['classname'].values[
            np.random.randint(0, len(driverSbtDf['classname'].values), 
                              size = smpN)]:
#         print '    class: %s' % (cls)
#         print "    driverSbtDf[driverSbtDf['classname'] == cls]['img'].shape = %s" % \
#             (driverSbtDf[driverSbtDf['classname'] == cls]['img'].shape)

        imgFnm = driverSbtDf[driverSbtDf['classname'] == cls]['img'].iloc[0]    
        dctKey = cls + ':' + imgFnm
        imgFnm = trnFoldersPth + '/' + cls + '/' + imgFnm                    
#         img = sk_io.imread(imgFnm)
#         assert img.shape == (480, 640, 3), 'img.shape: %s' % (img.shape)
        sbtImg[dctKey] = myreadImage(imgFnm)
#         jpgfile = Image(clsPth + '/' + onlyfiles[ix], format = 'jpg', 
#                         width = glbImg['size'] * 4, height = glbImg['size'] * 4)
#         display(jpgfile)

    smpSbtImg[sbt] = sbtImg
    
# print smpClsImg    
        
nRow = smpN; nCol = smpN    
figs, axes = plt.subplots(nRow, nCol, 
                          figsize=(6 * nCol, 6 * nRow))
[(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) for ax in axes.flatten()]
for i, sbt in enumerate(smpSbtImg.keys()):
    for j, imgDesc in enumerate(smpSbtImg[sbt].keys()):
        axes[i, j].imshow(smpSbtImg[sbt][imgDesc])
        axes[i, j].set_title(sbt + ':' + imgDesc)
In [20]:
print type('string')
<type 'str'>
In [21]:
def mytransformImage(raw, retVals = 'final'):
    assert retVals in ['final', 'each'], \
        'unsupported retVals option: %s' % (retVals)
    
    prcImgDct = {'raw': raw, 'fnl': raw.astype(float)}
    fnlShape = rawShape = raw.shape
    
    # 'crop'
    if ('crop' in glbImg.keys()):
        xmin = 0; xmax = rawShape[1]
        ymin = 0; ymax = rawShape[0]        
        if ('x' in glbImg['crop'].keys()):
            xmin, xmax = glbImg['crop']['x']
        if ('y' in glbImg['crop'].keys()):
            ymin, ymax = glbImg['crop']['y']
        
        if retVals == 'each':
            prcImgDct['crp'] = sk_transform.resize(raw[ymin : ymax, 
                                                       xmin : xmax], 
                                                   rawShape)
        prcImgDct['fnl'] = sk_transform.resize(
                            prcImgDct['fnl'][ymin : ymax, xmin : xmax], 
                                               rawShape)
    # 'size'        
#     if not glbImg['color']:        
#         fnlShape = (glbImg['size'], glbImg['size'], 1)
#     else:    
#         fnlShape = (glbImg['size'], glbImg['size'], rawShape[2])
    fnlShape = (glbImg['size'], glbImg['size'], rawShape[2])        
    if (rawShape != fnlShape):
        if retVals == 'each':
            prcImgDct['sze'] = sk_transform.resize(raw, fnlShape)
        prcImgDct['fnl'] = sk_transform.resize(prcImgDct['fnl'], fnlShape)
           
    # 'color'        
    if not glbImg['color']:
        if retVals == 'each':        
            prcImgDct['gry'] = sk_color.rgb2gray(raw)
        prcImgDct['fnl'] = sk_color.rgb2gray(prcImgDct['fnl'])
        
    # 'center_scale'            
    if glbImg['center_scale']:
        if retVals == 'each':        
            prcImgDct['c_s'] = (raw.astype(float) - glbImg['pxlDepth'] / 2.0) / \
                                glbImg['pxlDepth']
        prcImgDct['fnl'] = (prcImgDct['fnl'] - glbImg['pxlDepth'] / 2.0) / \
                                glbImg['pxlDepth']
        
    if retVals == 'final':
        return prcImgDct['fnl']
    else:
        return prcImgDct
        
sbt = smpSbtImg.keys()[0]
tstRawImg = smpSbtImg[sbt][smpSbtImg[sbt].keys()[0]]
tstPrcImg = mytransformImage(tstRawImg, retVals = 'final')
nRow = 1; nCol = 2
figs, axes = plt.subplots(nRow, nCol, 
                          figsize=(6 * nCol, 4 * nRow))
[(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) for ax in axes.flatten()]
for j, typImg in enumerate(range(2)):
    if (j == 0):
        axes[j].imshow(tstRawImg)
        axes[j].set_title('raw')
    if (j == 1):
        if not glbImg['color']:
            plt.imshow(tstPrcImg, cmap = 'gray')
        else:    
            plt.imshow(tstPrcImg)        
        axes[j].set_title('fnl')            
plt.show()        
    
tstPrcImg = mytransformImage(tstRawImg, retVals = 'each')
nRow = 1; nCol = 2
figs, axes = plt.subplots(nRow, nCol, 
                          figsize=(6 * nCol, 4 * nRow))
[(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) for ax in axes.flatten()]
for j, typImg in enumerate(range(2)):
    if (j == 0):
        axes[j].imshow(tstRawImg)
        axes[j].set_title('raw')        
    if (j == 1):
        if not glbImg['color']:
            plt.imshow(tstPrcImg['fnl'], cmap = 'gray')
        else:    
            plt.imshow(tstPrcImg['fnl'])        
        axes[j].set_title('fnl')            
nRow = 1; nCol = len(tstPrcImg.values()) - 2
figs, axes = plt.subplots(nRow, nCol, 
                          figsize=(6 * nCol, 4 * nRow))
[(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) for ax in axes.flatten()]
for j, typImg in enumerate(list(set(tstPrcImg.keys()) - set(['raw', 'fnl']))):
    if (typImg == 'gry'):
        axes[j].imshow(tstPrcImg[typImg], cmap = 'gray')
    else:    
        axes[j].imshow(tstPrcImg[typImg])
    axes[j].set_title(typImg)
In [22]:
print type('string')
<type 'str'>
In [23]:
smpSbt0Img = smpSbtImg[smpSbtImg.keys()[0]]
smpPrcImg = {}
for key, value in smpSbt0Img.items():
    smpPrcImg[smpSbtImg.keys()[0] + ':' + key] = value
    
print 'smpPrcImg.keys(): %s' % (smpPrcImg.keys())
for key, raw in smpPrcImg.items():
    prcImgDct = mytransformImage(raw, retVals = 'each')        
    smpPrcImg[key] = prcImgDct

# Ideally 'fnl' should be the last col in the plot    
nRow = len(smpPrcImg.keys()); nCol = len(smpPrcImg.values()[0].keys())
# print 'nRow: %d; nCol: %d' % (nRow, nCol)
figs, axes = plt.subplots(nRow, nCol, 
                          figsize=(6 * nCol, 4 * nRow))
[(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) for ax in axes.flatten()]
for i, sbtClsImgFnm in enumerate(smpPrcImg.keys()):
    for j, typImg in enumerate(smpPrcImg[sbtClsImgFnm].keys()):
        if ((typImg == 'gry') or 
            ((typImg == 'fnl') and ('gry' in smpPrcImg[sbtClsImgFnm].keys()))):
            if (nRow > 1):
                axes[i, j].imshow(smpPrcImg[sbtClsImgFnm][typImg], cmap = 'gray')
            else:
                axes[j].imshow(smpPrcImg[sbtClsImgFnm][typImg], cmap = 'gray')
        else:    
            if (nRow > 1):            
                axes[i, j].imshow(smpPrcImg[sbtClsImgFnm][typImg])
            else:    
                axes[j].imshow(smpPrcImg[sbtClsImgFnm][typImg])
        if (nRow > 1):        
            axes[i, j].set_title(sbtClsImgFnm + ':' + typImg)
        else:    
            axes[j].set_title(sbtClsImgFnm + ':' + typImg)        
smpPrcImg.keys(): ['p016:c9:img_57609.jpg', 'p016:c8:img_100735.jpg']

Display sample test images

In [24]:
print type('string')
<type 'str'>
In [25]:
onlyfiles = [f for f in os.listdir(newFoldersPth) 
                    if os.path.isfile(os.path.join(newFoldersPth, f))]
# print onlyfiles[:5]

smpNewImg = {}; smpN = 3
# print smpN ** 2
# print np.random.randint(0, len(onlyfiles), size = smpN ** 2)
for imgFnm in [onlyfiles[ix] 
               for ix in np.random.randint(0, len(onlyfiles), size = smpN ** 2)]:

#     print '  imgFnm: %s' % (imgFnm)

#     img = sk_io.imread(newFoldersPth + '/' + imgFnm)
#     assert img.shape == (480, 640, 3), 'img.shape: %s' % (img.shape)
    smpNewImg[imgFnm] = myreadImage(newFoldersPth + '/' + imgFnm)
        
nRow = smpN; nCol = smpN    
figs, axes = plt.subplots(nRow, nCol, 
                          figsize=(6 * nCol, 5 * nRow))
[(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) for ax in axes.flatten()]
for i, imgFnm in enumerate(smpNewImg.keys()):
    axes[i / nCol, i % nCol].imshow(smpNewImg[imgFnm])
    axes[i / nCol, i % nCol].set_title(imgFnm)

notMNINST:

Each exemplar should be an image of a character A through J rendered in a different font.

In [26]:
# Display sample train images
# train_folders_path = '/Users/bbalaji-2012/Documents/Work/Courses/Udacity/DeepLearning/code/tensorflow/examples/udacity/data/notMNIST_large/'
# glbImg['size'] = 28
# display(Image(train_folders_path + 'A/a2F6b28udHRm.png', \
#               width = glbImg['size'] * 4, height = glbImg['size'] * 4))
# display(Image(train_folders_path + 'B/bnVuaS50dGY=.png', \
#               width = glbImg['size'] * 4, height = glbImg['size'] * 4))
# display(Image(train_folders_path + 'C/cmlzay50dGY=.png', \
#               width = glbImg['size'] * 4, height = glbImg['size'] * 4))

Populate database

Now let's load the data in a more manageable format.

We'll convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean (notMNIST only: and standard deviation ~0.5) to make training easier down the road. The labels will be stored into a separate array (notMNINST only: of integers 0 through 9.)

A few images might not be readable, we'll just skip them.

In [27]:
trnFolders = os.getcwd() + '/data/' + glbDataFile['trnFoldersPth']
trnFolders = [trnFolders + '/' + cls for cls in glbRspClass]
print 'trnFolders: %s' % (trnFolders)
newFolders = [os.getcwd() + '/data/' + glbDataFile['newFoldersPth']]
print 'newFolders: %s' % (newFolders)
trnFolders: ['/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c0', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c1', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c2', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c3', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c4', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c5', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c6', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c7', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c8', '/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c9']
newFolders: ['/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/test']
In [28]:
# data_folders_path = '/Users/bbalaji-2012/Documents/Work/Courses/Udacity/DeepLearning/code/tensorflow/examples/udacity/data/'
# train_folders = [data_folders_path + 'notMNIST_large/' + d \
#                  for d in sorted(os.listdir(data_folders_path + 'notMNIST_large/')) \
#                     if d != '.DS_Store']
# print train_folders
# test_folders  = [data_folders_path + 'notMNIST_small/' + d \
#                  for d in sorted(os.listdir(data_folders_path + 'notMNIST_small/')) \
#                     if d != '.DS_Store']
# print test_folders
In [29]:
#from scipy import misc as sp_misc
In [30]:
def load(idClass, folderPth, nImgMax, maxCheck = True, verbose = False):
  
    assert isinstance(idClass, str), \
        'expecting type(idClass) as str, not %s' % (type(idClass))  

    assert isinstance(folderPth, str), \
        'expecting type(folderPth) as str, not %s' % (type(folderPth))  
    
    assert nImgMax > 0, \
        'nImgMax: %d has to be > 0' % (nImgMax)  
    
    assert isinstance(maxCheck, bool), \
        'expecting type(maxCheck) as bool, not %s' % (type(maxCheck))  
    
    startTm = datetime.now()  
    
    ids = ['' for ix in xrange(nImgMax)]  
    dataset = np.ndarray(
        shape=(nImgMax, glbImg['size'], glbImg['size']), dtype=np.float32)
    labels = np.ndarray(shape=(nImgMax), dtype=np.int32)
#   label_index = 0
    try:
        labelsVal = glbRspClass.index(idClass)
    except ValueError, e:
        print 'unknown class: %s; defaulting label to -1' % (idClass)
        labelsVal = -1
    except Exception, e:
        print(e)
        raise
        
    labels[:] = labelsVal  
    image_index = 0

#   if isinstance(data_folders, str):
#     data_folders = [data_folders]

#   for fldrIx, folder in enumerate(data_folders):
    print 'Class: %s; Folder: %s' % (idClass, folderPth)
#     print(os.listdir(folder)[:6])
    for image in os.listdir(folderPth):
#       print(image)
#       print((image_index >= (nImgMax / len(data_folders) * (fldrIx + 1))))
      if maxCheck and (image_index >= nImgMax):
        raise Exception('More images than expected: %d >= %d' % (
          image_index, nImgMax))
#       elif (image_index >= (nImgMax / len(data_folders) * (fldrIx + 1))):
      elif image_index >= nImgMax: break
        
      image_file = os.path.join(folderPth, image)
      try:
        rawImg = myreadImage(image_file)
      except IOError as e:
        print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')  
        next
        
      prcImg = mytransformImage(rawImg, retVals = 'final')  
#       try:
#         rsz_image_data = sp_misc.imresize(ndimage.imread(image_file, flatten = not glbImgColor), 
#                                       (glbImg['size'], glbImg['size']))
#         image_data = (rsz_image_data.astype(float) -
#                       glbImgPixelDepth / 2) / glbImgPixelDepth
#         if image_data.shape != (glbImg['size'], glbImg['size']):
#           raise Exception('Unexpected image shape: %s' % str(image_data.shape))
        
      ids[image_index] = image
      dataset[image_index, :, :] = prcImg
#       labels[image_index] = label_index
        
      if mydspVerboseTrigger(image_index): 
#             print '  image_index: %d; %s:' % (image_index, image)
            print '  image_index: %5d (%5d secs)' % \
                (image_index, (datetime.now() - startTm).seconds)
            if verbose:
                nRow = 1; nCol = 2
                figs, axes = plt.subplots(nRow, nCol, 
                                              figsize=(6 * nCol, 4 * nRow))
                [(ax.set_xticks([]), ax.set_yticks([]), ax.axis('off')) 
                     for ax in axes.flatten()]
                for j, typImg in enumerate(range(0, nCol)):
                    if (j == 0):
                        axes[j].imshow(rawImg)
    #                     axes[j].set_title(glbRspClass[label_index] + ':' + image + ':raw')                    
                        axes[j].set_title(idClass + ':' + image + ':raw')                                        
                    else:    
                        if not glbImg['color']:
                            axes[j].imshow(prcImg, cmap = 'gray')
                        else:    
                            axes[j].imshow(prcImg)
                        axes[j].set_title('fnl')
    #             display(sp_misc.toimage(rsz_image_data))
                plt.show()
            
      image_index += 1            
#     label_index += 1
    
    num_images = image_index
    ids = ids[0:num_images]  
    dataset = dataset[0:num_images, :, :]
    labels = labels[0:num_images]
#   if num_images < min_num_images:
#     raise Exception('Many fewer images than expected: %d < %d' % (
#         num_images, min_num_images))
    print('  Identifiers:', len(ids))
    print('  Full dataset tensor:', dataset.shape)
    print('  Mean:', np.mean(dataset))
    print('  Standard deviation:', np.std(dataset))
    print('  Labels:', labels.shape)
    print('  Label Knts:'); print(pd.Series(labels).value_counts())    
    
    return {'Cls': idClass, 'Dbs': {'Idn': ids, 'Ftr': dataset, 'Rsp': labels}}

smpC5ObsTrnDct = load('c5', trnFolders[5], 25, maxCheck = False, verbose = True)
smpObsNewDct = load('new', newFolders[0], 25, maxCheck = False, verbose = False)
# smqObsTrnIdn, smqObsTrnFtr, smqObsTrnRsp = load(trnFolders, 250, 
#                                                 max_check = False)
# print smpObsTrnRsp.value_counts()
# print smpObsTrnIdn[10:15]
# glbObsTrnIdn, glbObsTrnFtr, glbObsTrnRsp = load(trnFolders, 22435)
Class: c5; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c5
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    1 secs)
  image_index:     6 (    1 secs)
  image_index:     8 (    2 secs)
  image_index:    20 (    3 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.11314436)
('  Standard deviation:', 0.32475406)
('  Labels:', (25,))
  Label Knts:
5    25
dtype: int64
unknown class: new; defaulting label to -1
Class: new; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/test
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.10439148)
('  Standard deviation:', 0.33399168)
('  Labels:', (25,))
  Label Knts:
-1    25
dtype: int64

Compare sequential vs. parallel loading results

In [31]:
thsBgnTm = datetime.now()
smqObsTrnLst = []
# for cls in glbRspClass[-2:]:
for cls in glbRspClass:    
    smqClsObsTrnDct = load(cls, trnFolders[glbRspClass.index(cls)], 25, 
                           maxCheck = False, verbose = False)
    smqObsTrnLst.append(smqClsObsTrnDct)

print 'len(smqObsTrnLst): %d' % (len(smqObsTrnLst))    
thsDurDff = (datetime.now() - thsBgnTm).seconds  
print 'Trn Smp Sequential load duration: %0.2f seconds' % (thsDurDff) 
Class: c0; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c0
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.10371291)
('  Standard deviation:', 0.3219693)
('  Labels:', (25,))
  Label Knts:
0    25
dtype: int64
Class: c1; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c1
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.11209381)
('  Standard deviation:', 0.32118186)
('  Labels:', (25,))
  Label Knts:
1    25
dtype: int64
Class: c2; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c2
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.10655484)
('  Standard deviation:', 0.33100829)
('  Labels:', (25,))
  Label Knts:
2    25
dtype: int64
Class: c3; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c3
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.099427894)
('  Standard deviation:', 0.32730374)
('  Labels:', (25,))
  Label Knts:
3    25
dtype: int64
Class: c4; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c4
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.10423807)
('  Standard deviation:', 0.31945148)
('  Labels:', (25,))
  Label Knts:
4    25
dtype: int64
Class: c5; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c5
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.11314436)
('  Standard deviation:', 0.32475406)
('  Labels:', (25,))
  Label Knts:
5    25
dtype: int64
Class: c6; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c6
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.099318922)
('  Standard deviation:', 0.33076665)
('  Labels:', (25,))
  Label Knts:
6    25
dtype: int64
Class: c7; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c7
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.087649062)
('  Standard deviation:', 0.32005942)
('  Labels:', (25,))
  Label Knts:
7    25
dtype: int64
Class: c8; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c8
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.10032745)
('  Standard deviation:', 0.32751)
('  Labels:', (25,))
  Label Knts:
8    25
dtype: int64
Class: c9; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c9
  image_index:     0 (    0 secs)
  image_index:     2 (    0 secs)
  image_index:     4 (    0 secs)
  image_index:     6 (    0 secs)
  image_index:     8 (    0 secs)
  image_index:    20 (    1 secs)
('  Identifiers:', 25)
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.086796135)
('  Standard deviation:', 0.33149031)
('  Labels:', (25,))
  Label Knts:
9    25
dtype: int64
len(smqObsTrnLst): 10
Sequential load duration: 21.00 seconds
In [32]:
thsBgnTm = datetime.now()
smrObsTrnLst = Parallel(n_jobs = nCores, verbose = 1)(delayed(
        load)(cls, trnFolders[glbRspClass.index(cls)], 25, 
                maxCheck = False, verbose = False) for cls in glbRspClass)
print 'len(smrObsTrnLst): %d' % (len(smrObsTrnLst))    
thsDurDff = (datetime.now() - thsBgnTm).seconds  
print 'Trn Smp Parallel load duration: %0.2f seconds' % (thsDurDff) 
len(smrObsTrnLst): 10
Parallel load duration: 3.00 seconds
Class: c0; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c0
Class: c1; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c1
Class: c2; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c2
Class: c3; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c3
Class: c4; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c4
Class: c5; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c5
Class: c6; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c6
Class: c7; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c7
Class: c9; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c9
Class: c8; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c8
  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)









  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)









  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)  image_index:     4 (    0 secs)









  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)  image_index:     6 (    0 secs)









  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)









  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)  image_index:    20 (    2 secs)









('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)('  Identifiers:', 25)









('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Full dataset tensor:', (25, 64, 64))
('  Mean:', -0.10371291)
('  Mean:', -0.11209381)
('  Mean:', -0.10655484)
('  Mean:', -0.099427894)
('  Mean:', -0.10423807)
('  Mean:', -0.11314436)
('  Mean:', -0.099318922)
('  Mean:', -0.087649062)
('  Mean:', -0.086796135)
('  Mean:', -0.10032745)
('  Standard deviation:', 0.3219693)
('  Standard deviation:', 0.32118186)
('  Standard deviation:', 0.33100829)
('  Standard deviation:', 0.32730374)
('  Standard deviation:', 0.31945148)
('  Standard deviation:', 0.32475406)
('  Standard deviation:', 0.33076665)
('  Standard deviation:', 0.32005942)
('  Standard deviation:', 0.33149031)
('  Standard deviation:', 0.32751)
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
('  Labels:', (25,))
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
0    25
dtype: int641    25
dtype: int642    25
dtype: int643    25
dtype: int644    25
dtype: int645    25
dtype: int646    25
dtype: int647    25
dtype: int649    25
dtype: int648    25
dtype: int64









[Parallel(n_jobs=14)]: Done  10 out of  10 | elapsed:    3.3s finished
In [33]:
def myisEqualDct(d1, d2):
    d1_keys = set(d1.keys())
    d2_keys = set(d2.keys())
    intersect_keys = d1_keys.intersection(d2_keys)
    added = d1_keys - d2_keys
    removed = d2_keys - d1_keys
    
#     modified = {o : (d1[o], d2[o]) for o in intersect_keys if d1[o] != d2[o]}
    modified = {}
    for o in intersect_keys:
        if not (isinstance(d1[o], dict)):
            try:
                eql = d1[o] == d2[o]
    #             eql = (d1[o] == d2[o]) if not (isinstance(d1[o], dict)) else \
    #                   myisEqualDct(d1[o], d2[o])
            except ValueError, e:
                print e
                print 'key: %s: type:' % (o) 
                print type(d1[o]).mro()
                raise
        else: eql = myisEqualDct(d1[o], d2[o])
        if not isinstance(eql, bool):
#             print 'eql:'; print eql    
            eql = eql.all()
        if not eql: modified[o] = eql
        
    same = set(o for o in intersect_keys if not o in modified.keys())
    
    if (len(added) > 0):
        print '     added: %s' % (added)
    if (len(removed) > 0):
        print '   removed: %s' % (removed)
    if (len(modified) > 0):
        print '  modified: %s' % (modified)        
    if (len(same) != len(d1_keys)):
        print '      same: %s' % (same)        
    
    return ((len(added)    == 0) and 
            (len(removed)  == 0) and 
            (len(modified) == 0) and             
            (len(same)     == len(d2_keys)))

tstAB1Dct = {'a': 1, 'b': 1}; tstAB2Dct = {'a': 1, 'b': 2}
print myisEqualDct(tstAB1Dct, tstAB1Dct) 
print myisEqualDct(tstAB1Dct, tstAB2Dct) 
tstABC1Dct = {'ab': tstAB1Dct, 'c' : 1}; 
tstABC2Dct = {'ab': tstAB2Dct, 'c' : 3}; 
print myisEqualDct(tstABC1Dct, tstABC1Dct) 
print myisEqualDct(tstABC1Dct, tstABC2Dct) 
True
  modified: {'b': False}
      same: set(['a'])
False
True
  modified: {'b': False}
      same: set(['a'])
  modified: {'c': False, 'ab': False}
      same: set([])
False
In [34]:
print 'len(smqObsTrnLst): %d' % (len(smqObsTrnLst)) 
print 'len(smrObsTrnLst): %d' % (len(smrObsTrnLst)) 
for clsIx in range(len(glbRspClass)):
#     print 'clsIx: %s' % (clsIx)
#     print "type(smqObsTrnLst[clsIx]['Dbs']):" 
#     print    (str(type(smqObsTrnLst[clsIx]['Dbs']).mro()))    
#     print "type(smqObsTrnLst[clsIx]['Dbs']): %s" \
#         (str(type(smqObsTrnLst[clsIx]['Dbs']).mro()))    
#     print smqObsTrnLst[clsIx]
    assert myisEqualDct(smqObsTrnLst[clsIx], smrObsTrnLst[clsIx]), \
        'diff in class: %s' % glbRspClass[clsIx]        
len(smqObsTrnLst): 10
len(smrObsTrnLst): 10
In [36]:
print type('string')
<type 'str'>
In [49]:
# print 'numpy.ndarray' in type(smqObsTrnLst[9]['Dbs']['Rsp']).mro()
# print type(smqObsTrnLst[9]['Dbs']['Rsp'])
# print smqObsTrnLst[9]['Dbs']['Rsp'].shape
# print smqObsTrnLst[9]['Dbs']['Rsp']

# print type(smrObsTrnLst[9]['Dbs']['Rsp'])
# print smrObsTrnLst[9]['Dbs']['Rsp'].shape
# print smrObsTrnLst[9]['Dbs']['Rsp']
# print pd.Series(smrObsTrnRsp[9]['Dbs']['Rsp'])
# print pd.Series(smrObsTrnRsp[9]['Dbs']['Rsp']).value_counts()

tstArr = smrObsTrnLst[9]['Dbs']['Rsp']
print pd.Series(tstArr)
0     9
1     9
2     9
3     9
4     9
5     9
6     9
7     9
8     9
9     9
10    9
11    9
12    9
13    9
14    9
15    9
16    9
17    9
18    9
19    9
20    9
21    9
22    9
23    9
24    9
dtype: int32
In [52]:
def mybuildDatabase(lclObsLst):
    # lclObsLst dictionary structure:
    #   {'Cls': idClass, 'Dbs': {'Idn': ids, 'Ftr': dataset, 'Rsp': labels}
    lclObsIdn = []
#     assert isinstance(lclObsIdn, list), 'lclObsIdn is not a list'
#     print 'type(lclObsIdn): %s' % type(lclObsIdn)
    lclObsFtr = lclObsRsp = None   
    for clsIx in range(len(lclObsLst)):
        lclObsIdn.extend(lclObsLst[clsIx]['Dbs']['Idn'])
        lclObsFtr = np.vstack((lclObsFtr, 
                               lclObsLst[clsIx]['Dbs']['Ftr'])) \
            if not (lclObsFtr == None) else lclObsLst[clsIx]['Dbs']['Ftr']
        lclObsRsp = np.hstack((lclObsRsp, 
                               lclObsLst[clsIx]['Dbs']['Rsp'])) \
            if not (lclObsRsp == None) else lclObsLst[clsIx]['Dbs']['Rsp']
#     print lclObsIdn    
    return lclObsIdn, lclObsFtr, lclObsRsp
    
smrObsTrnIdn, smrObsTrnFtr, smrObsTrnRsp = mybuildDatabase(smrObsTrnLst)
print('Identifiers:', len(smrObsTrnIdn))
print('Sample dataset tensor:', smrObsTrnFtr.shape)
print('Mean:', np.mean(smrObsTrnFtr))
print('Standard deviation:', np.std(smrObsTrnFtr))
print('Labels:', smrObsTrnRsp.shape)
# print(smrObsTrnRsp[25:30])
print('Label Knts:'); print(pd.Series(smrObsTrnRsp).value_counts())
('Identifiers:', 250)
('Full dataset tensor:', (250, 64, 64))
('Mean:', -0.10132633)
('Standard deviation:', 0.32568815)
('Labels:', (250,))
Label Knts:
9    25
8    25
7    25
6    25
5    25
4    25
3    25
2    25
1    25
0    25
dtype: int64
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:12: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:15: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
In [53]:
thsBgnTm = datetime.now()
glbObsTrnLst = Parallel(n_jobs = nCores, verbose = 1)(delayed(
        load)(cls, trnFolders[glbRspClass.index(cls)], 2500, 
                maxCheck = True, verbose = False) for cls in glbRspClass)
print 'len(glbObsTrnLst): %d' % (len(glbObsTrnLst))    
thsDurDff = (datetime.now() - thsBgnTm).seconds  
print 'Trn Parallel load duration: %0.2f seconds' % (thsDurDff) 
len(glbObsTrnLst): 10
Parallel load duration: 378.00 seconds
Class: c0; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c0
Class: c1; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c1
Class: c2; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c2
Class: c3; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c3
Class: c4; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c4
Class: c5; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c5
Class: c6; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c6
Class: c7; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c7
Class: c8; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c8
Class: c9; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/train/c9
  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)  image_index:     0 (    0 secs)









  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)  image_index:     2 (    0 secs)









  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)  image_index:     4 (    1 secs)









  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)  image_index:     6 (    1 secs)









  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)  image_index:     8 (    1 secs)









  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)  image_index:    20 (    3 secs)









  image_index:    40 (    5 secs)  image_index:    40 (    6 secs)  image_index:    40 (    6 secs)  image_index:    40 (    6 secs)  image_index:    40 (    5 secs)  image_index:    40 (    6 secs)  image_index:    40 (    5 secs)  image_index:    40 (    5 secs)  image_index:    40 (    5 secs)  image_index:    40 (    5 secs)









  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)  image_index:    60 (    8 secs)









  image_index:    80 (   10 secs)  image_index:    80 (   11 secs)  image_index:    80 (   11 secs)  image_index:    80 (   11 secs)  image_index:    80 (   11 secs)  image_index:    80 (   11 secs)  image_index:    80 (   11 secs)  image_index:    80 (   11 secs)  image_index:    80 (   11 secs)  image_index:    80 (   10 secs)









  image_index:   200 (   28 secs)  image_index:   200 (   28 secs)  image_index:   200 (   28 secs)  image_index:   200 (   28 secs)  image_index:   200 (   29 secs)  image_index:   200 (   31 secs)  image_index:   200 (   29 secs)  image_index:   200 (   28 secs)  image_index:   200 (   28 secs)  image_index:   200 (   29 secs)









  image_index:   400 (   57 secs)  image_index:   400 (   57 secs)  image_index:   400 (   57 secs)  image_index:   400 (   56 secs)  image_index:   400 (   58 secs)  image_index:   400 (   61 secs)  image_index:   400 (   57 secs)  image_index:   400 (   56 secs)  image_index:   400 (   56 secs)  image_index:   400 (   57 secs)









  image_index:   600 (   84 secs)  image_index:   600 (   84 secs)  image_index:   600 (   85 secs)  image_index:   600 (   84 secs)  image_index:   600 (   86 secs)  image_index:   600 (   91 secs)  image_index:   600 (   85 secs)  image_index:   600 (   84 secs)  image_index:   600 (   85 secs)  image_index:   600 (   85 secs)









  image_index:   800 (  117 secs)  image_index:   800 (  117 secs)  image_index:   800 (  117 secs)  image_index:   800 (  117 secs)  image_index:   800 (  120 secs)  image_index:   800 (  123 secs)
[Parallel(n_jobs=14)]: Done  10 out of  10 | elapsed:  6.3min finished
  image_index:   800 (  119 secs)  image_index:   800 (  117 secs)  image_index:   800 (  117 secs)  image_index:   800 (  118 secs)









  image_index:  2000 (  316 secs)  image_index:  2000 (  314 secs)  image_index:  2000 (  314 secs)  image_index:  2000 (  318 secs)  image_index:  2000 (  322 secs)  image_index:  2000 (  318 secs)  image_index:  2000 (  315 secs)  image_index:  2000 (  312 secs)('  Identifiers:', 1911)  image_index:  2000 (  316 secs)









('  Identifiers:', 2489)('  Identifiers:', 2267)('  Identifiers:', 2317)('  Identifiers:', 2346)('  Identifiers:', 2326)('  Identifiers:', 2312)('  Identifiers:', 2325)('  Identifiers:', 2002)('  Full dataset tensor:', (1911, 64, 64))
('  Identifiers:', 2129)







('  Mean:', -0.10261098)

('  Full dataset tensor:', (2489, 64, 64))
('  Full dataset tensor:', (2267, 64, 64))
('  Full dataset tensor:', (2317, 64, 64))
('  Full dataset tensor:', (2346, 64, 64))
('  Full dataset tensor:', (2326, 64, 64))
('  Full dataset tensor:', (2312, 64, 64))
('  Full dataset tensor:', (2325, 64, 64))
('  Full dataset tensor:', (2002, 64, 64))
('  Standard deviation:', 0.32375273)('  Full dataset tensor:', (2129, 64, 64))
('  Mean:', -0.10268358)
('  Mean:', -0.1035179)
('  Mean:', -0.10422858)
('  Mean:', -0.10468703)
('  Mean:', -0.098030552)
('  Mean:', -0.098016791)
('  Mean:', -0.096125968)
('  Mean:', -0.094121233)

('  Mean:', -0.088676713)
('  Standard deviation:', 0.32731777)('  Standard deviation:', 0.32373857)('  Standard deviation:', 0.32976153)('  Standard deviation:', 0.32440081)('  Standard deviation:', 0.32597044)('  Standard deviation:', 0.32542631)('  Standard deviation:', 0.32982066)('  Standard deviation:', 0.32239816)('  Labels:', (1911,))
('  Standard deviation:', 0.32720071)







  Label Knts:

('  Labels:', (2489,))
('  Labels:', (2267,))
('  Labels:', (2317,))
('  Labels:', (2346,))
('  Labels:', (2326,))
('  Labels:', (2312,))
('  Labels:', (2325,))
('  Labels:', (2002,))
8    1911
dtype: int64('  Labels:', (2129,))
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:
  Label Knts:

  Label Knts:
0    2489
dtype: int641    2267
dtype: int642    2317
dtype: int643    2346
dtype: int644    2326
dtype: int645    2312
dtype: int646    2325
dtype: int647    2002
dtype: int649    2129
dtype: int64








In [54]:
glbObsTrnIdn, glbObsTrnFtr, glbObsTrnRsp = mybuildDatabase(glbObsTrnLst)
print('Identifiers:', len(glbObsTrnIdn))
print('Full dataset tensor:', glbObsTrnFtr.shape)
print('Mean:', np.mean(glbObsTrnFtr))
print('Standard deviation:', np.std(glbObsTrnFtr))
print('Labels:', glbObsTrnRsp.shape)
print('Label Knts:'); print(pd.Series(glbObsTrnRsp).value_counts())
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:12: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:15: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
('Identifiers:', 22424)
('Full dataset tensor:', (22424, 64, 64))
('Mean:', -0.099392936)
('Standard deviation:', 0.32611963)
('Labels:', (22424,))
Label Knts:
0    2489
3    2346
4    2326
6    2325
2    2317
5    2312
1    2267
9    2129
7    2002
8    1911
dtype: int64

Move display of train images here

Move test images to different folders to parallelize. Change newObsTrnLst to glbObsNewLst

In [57]:
thsBgnTm = datetime.now()
newObsTrnLst = [load('new', newFolders[0], 80000, 
                           maxCheck = True, verbose = True)]
# smpObsNewDct = load('new', newFolders[0], 25, maxCheck = False, verbose = False)
print 'len(newObsTrnLst): %d' % (len(newObsTrnLst))    
thsDurDff = (datetime.now() - thsBgnTm).seconds  
print 'newObs load duration: %0.2f seconds' % (thsDurDff) 
unknown class: new; defaulting label to -1
Class: new; Folder: /Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/test
  image_index:     0 (    2 secs)
  image_index:     2 (    2 secs)
  image_index:     4 (    3 secs)
  image_index:     6 (    4 secs)
  image_index:     8 (    4 secs)
  image_index:    20 (    5 secs)
  image_index:    40 (    8 secs)
  image_index:    60 (   10 secs)
  image_index:    80 (   12 secs)
  image_index:   200 (   22 secs)
  image_index:   400 (   39 secs)
  image_index:   600 (   55 secs)
  image_index:   800 (   72 secs)
  image_index:  2000 (  170 secs)
  image_index:  4000 (  335 secs)
  image_index:  6000 (  500 secs)
  image_index:  8000 (  663 secs)
  image_index: 20000 ( 1642 secs)
  image_index: 40000 ( 3349 secs)
  image_index: 60000 ( 5194 secs)
('  Identifiers:', 79726)
('  Full dataset tensor:', (79726, 64, 64))
('  Mean:', -0.097465999)
('  Standard deviation:', 0.33075851)
('  Labels:', (79726,))
  Label Knts:
-1    79726
dtype: int64
len(newObsTrnLst): 1
newObs load duration: 6999.00 seconds
In [58]:
glbObsNewLst = newObsTrnLst
In [59]:
glbObsNewIdn, glbObsNewFtr, glbObsNewRsp = mybuildDatabase(glbObsNewLst)
print('Identifiers:', len(glbObsNewIdn))
print('New Full dataset tensor:', glbObsNewFtr.shape)
print('Mean:', np.mean(glbObsNewFtr))
print('Standard deviation:', np.std(glbObsNewFtr))
print('Labels:', glbObsNewRsp.shape)
print('Label Knts:'); print(pd.Series(glbObsNewRsp).value_counts())
('Identifiers:', 79726)
('New Full dataset tensor:', (79726, 64, 64))
('Mean:', -0.097465999)
('Standard deviation:', 0.33075851)
('Labels:', (79726,))
Label Knts:
-1    79726
dtype: int64

Display sample Trn images from glbObsTrnFtr

In [107]:
print glbObsTrnIdn[100:105]
glbObsNewIdn, glbObsNewFtr, glbObsNewRsp = load(newFolders, 79726)
['img_12203.jpg', 'img_12237.jpg', 'img_12238.jpg', 'img_12247.jpg', 'img_12279.jpg']
/Users/bbalaji-2012/Documents/Work/DataScience/Kaggle/StateFarm/data/imgs/test
  image_index: 0; img_1.jpg:
  image_index: 20; img_100017.jpg:
  image_index: 40; img_100040.jpg:
  image_index: 60; img_100066.jpg:
  image_index: 80; img_100086.jpg:
  image_index: 200; img_10022.jpg:
  image_index: 400; img_100459.jpg:
  image_index: 600; img_100684.jpg:
  image_index: 800; img_100910.jpg:
  image_index: 2000; img_10358.jpg:
  image_index: 4000; img_12655.jpg:
  image_index: 6000; img_14950.jpg:
  image_index: 8000; img_17241.jpg:
  image_index: 20000; img_31120.jpg:
  image_index: 40000; img_5424.jpg:
  image_index: 60000; img_77326.jpg:
('Identifiers:', 79726)
('Full dataset tensor:', (79726, 32, 32))
('Mean:', -0.13392374)
('Standard deviation:', 0.2979852)
('Labels:', (79726,))
In [108]:
print glbObsNewIdn[1000:1005]
savObsNewRsp = glbObsNewRsp
glbObsNewRsp[:] = -1
print glbObsNewRsp[1000:1005]
['img_101147.jpg', 'img_101148.jpg', 'img_101149.jpg', 'img_10115.jpg', 'img_101150.jpg']
[-1 -1 -1 -1 -1]
In [49]:
# def load(data_folders, min_num_images, nImgMax):
#   dataset = np.ndarray(
#     shape=(nImgMax, glbImg['size'], glbImg['size']), dtype=np.float32)
#   labels = np.ndarray(shape=(nImgMax), dtype=np.int32)
#   label_index = 0
#   image_index = 0
#   for folder in data_folders:
#     print(folder)
#     for image in os.listdir(folder):
#       if image_index >= nImgMax:
#         raise Exception('More images than expected: %d >= %d' % (
#           image_index, nImgMax))
#       image_file = os.path.join(folder, image)
#       try:
#         image_data = (ndimage.imread(image_file).astype(float) -
#                       glbImgPixelDepth / 2) / glbImgPixelDepth
#         if image_data.shape != (glbImg['size'], glbImg['size']):
#           raise Exception('Unexpected image shape: %s' % str(image_data.shape))
#         dataset[image_index, :, :] = image_data
#         labels[image_index] = label_index
#         image_index += 1
#       except IOError as e:
#         print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')
#     label_index += 1
#   num_images = image_index
#   dataset = dataset[0:num_images, :, :]
#   labels = labels[0:num_images]
#   if num_images < min_num_images:
#     raise Exception('Many fewer images than expected: %d < %d' % (
#         num_images, min_num_images))
#   print('Full dataset tensor:', dataset.shape)
#   print('Mean:', np.mean(dataset))
#   print('Standard deviation:', np.std(dataset))
#   print('Labels:', labels.shape)
#   return dataset, labels

# glbObsTrnFtr, glbObsTrnRsp = load(train_folders, 450000, 550000)
# glbObsNewFtr, glbObsNewRsp = load(test_folders, 18000, 20000)

We expect the data to be balanced across classes. Verify that.

In [109]:
print 'glbObsTrnRsp class knts: '
print (np.unique(glbObsTrnRsp, return_counts = True))
print 'glbObsNewRsp class knts: '
print (np.unique(glbObsNewRsp, return_counts = True))
glbObsTrnRsp class knts: 
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32), array([2489, 2267, 2317, 2346, 2326, 2312, 2325, 2002, 1911, 2129]))
glbObsNewRsp class knts: 
(array([-1], dtype=int32), array([79726]))
In [110]:
#print type(glbObsTrnRsp); print glbObsTrnRsp.shape; print glbObsTrnRsp[0:10]
# print np.sum(glbObsTrnRsp == 0)
# print np.unique(glbObsTrnRsp)
# print 'train labels freqs: %s' % \
#     ([np.sum(glbObsTrnRsp == thsLabel) for thsLabel in np.unique(glbObsTrnRsp)])

Scrub data

In [ ]:
Refer to glbDataScrub

Export database

Save imported data.

In [60]:
glbPickleFile
Out[60]:
{'data': 'data/img_D_SFDD_ImgSz_64.pickle',
 'models': 'data/img_M_SFDD_ImgSz_64.pickle'}
In [63]:
try:
  f = open(glbPickleFile['data'], 'wb')
  save = {
    'glbObsTrnIdn': glbObsTrnIdn,
    'glbObsTrnFtr': glbObsTrnFtr,
    'glbObsTrnRsp': glbObsTrnRsp,
#     'glbObsVldFtr': glbObsVldFtr,
#     'glbObsVldRsp': glbObsVldRsp,
    'glbObsNewIdn': glbObsNewIdn,
    'glbObsNewFtr': glbObsNewFtr,
    'glbObsNewRsp': glbObsNewRsp,
    }
  pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
  f.close()
except Exception as e:
  print('Unable to save data to', glbPickleFile['data'], ':', e)
  raise
    
statinfo = os.stat(glbPickleFile['data'])
print('Compressed Data pickle size:', statinfo.st_size)    
('Compressed Data pickle size:', 1676068142)
In [133]:
with open('data/' + glbPickleFile, 'rb') as f:
  save = pickle.load(f)
#   train_dataset = save['train_dataset']
#   train_labels = save['train_labels']
#   valid_dataset = save['valid_dataset']
#   valid_labels = save['valid_labels']
  glbObsNewIdn = save['glbObsNewIdn']
  glbObsNewFtr = save['glbObsNewFtr']
  glbObsNewRsp = save['glbObsNewRsp']
#   test_dataset = save['test_dataset']
#   test_labels = save['test_labels']
  del save  # hint to help gc free up memory
#   print('Training set', train_dataset.shape, train_labels.shape)
#   print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('New set:', len(glbObsNewIdn), glbObsNewFtr.shape, glbObsNewRsp.shape)
('New set:', 79726, (79726, 32, 32), (79726,))

Inspect Resized Image Data

Let's verify that the data still looks good. Displaying a sample of the labels and images from the ndarray.

In [114]:
def mydisplayImages(obsIdn, obsFtr, obsRsp):
    imgIxLst = np.random.random_integers(0, obsFtr.shape[0] - 1, 10)
    for imgIx in imgIxLst:
        if (obsRsp[imgIx] > -1):
            print '  imgIx: %d; id: %s; label: %s' % \
                (imgIx, obsIdn[imgIx], glbRspClass[obsRsp[imgIx]])
        else:    
            print '  imgIx: %d; id: %s; label: None' % (imgIx, obsIdn[imgIx])    
        plt.figure
        plt.imshow(obsFtr[imgIx,:,:], cmap = plt.cm.gray)
        plt.show()
In [116]:
print 'Trn set:'; mydisplayImages(glbObsTrnIdn, glbObsTrnFtr, glbObsTrnRsp)
train set:
  imgIx: 1998; id: img_81350.jpg; label: c0
  imgIx: 16380; id: img_9998.jpg; label: c6
  imgIx: 12179; id: img_25277.jpg; label: c5
  imgIx: 20739; id: img_28025.jpg; label: c9
  imgIx: 16429; id: img_101869.jpg; label: c7
  imgIx: 13843; id: img_91231.jpg; label: c5
  imgIx: 11535; id: img_92193.jpg; label: c4
  imgIx: 18030; id: img_8399.jpg; label: c7
  imgIx: 10849; id: img_64316.jpg; label: c4
  imgIx: 16244; id: img_94378.jpg; label: c6
In [59]:
# dspLabels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

# print 'train set:'
# imgIxLst = np.random.random_integers(0, glbObsTrnFtr.shape[0] - 1, 10)
# for imgIx in imgIxLst:
#     print 'imgIx: %d: label: %s' % (imgIx, dspLabels[glbObsTrnRsp[imgIx]])
#     plt.figure
#     plt.imshow(glbObsTrnFtr[imgIx,:,:], cmap = plt.cm.gray)
#     plt.show()
In [117]:
print 'New set:'; mydisplayImages(glbObsNewIdn, glbObsNewFtr, glbObsNewRsp)
New set:
  imgIx: 6018; id: img_14973.jpg; label: None
  imgIx: 54909; id: img_71431.jpg; label: None
  imgIx: 17412; id: img_28095.jpg; label: None
  imgIx: 66441; id: img_84692.jpg; label: None
  imgIx: 29698; id: img_42352.jpg; label: None
  imgIx: 21633; id: img_33020.jpg; label: None
  imgIx: 67069; id: img_85415.jpg; label: None
  imgIx: 46206; id: img_61395.jpg; label: None
  imgIx: 55126; id: img_71687.jpg; label: None
  imgIx: 53542; id: img_69827.jpg; label: None
imgIx: 61376: label: None
imgIx: 57757: label: None
imgIx: 47410: label: None
imgIx: 50825: label: None
imgIx: 36676: label: None
imgIx: 30483: label: None
imgIx: 37873: label: None
imgIx: 40371: label: None
imgIx: 28404: label: None
imgIx: 36545: label: None

Shuffle data

Next, we'll randomize the data. It's important to have the labels well shuffled for the training and test distributions to match.

In [129]:
# print type(glbObsTrnIdn)
# smpObsTrnIdn = glbObsTrnIdn[0:4]
# print smpObsTrnIdn
# print [smpObsTrnIdn[ix] for ix in [3, 1, 2, 0]]
# smpObsTrnIdn = [smpObsTrnIdn[ix] for ix in [3, 1, 2, 0]]
# print smpObsTrnIdn
<type 'list'>
['img_100026.jpg', 'img_10003.jpg', 'img_100050.jpg', 'img_100074.jpg']
['img_100074.jpg', 'img_10003.jpg', 'img_100050.jpg', 'img_100026.jpg']
['img_100074.jpg', 'img_10003.jpg', 'img_100050.jpg', 'img_100026.jpg']
In [130]:
np.random.seed(glbObsShuffleSeed)
def randomize(ids, dataset, labels):
  permutation = np.random.permutation(labels.shape[0])
  shuffled_ids = [ids[ix] for ix in permutation]
  shuffled_dataset = dataset[permutation,:,:]
  shuffled_labels = labels[permutation]
  return shuffled_ids, shuffled_dataset, shuffled_labels

glbObsTrnIdn, glbObsTrnFtr, glbObsTrnRsp = randomize(glbObsTrnIdn, glbObsTrnFtr, glbObsTrnRsp)
#glbObsNewIdn, glbObsNewFtr, glbObsNewRsp = randomize(glbObsNewIdn, glbObsNewFtr, glbObsNewRsp)
In [60]:
# np.random.seed(133)
# def randomize(dataset, labels):
#   permutation = np.random.permutation(labels.shape[0])
#   shuffled_dataset = dataset[permutation,:,:]
#   shuffled_labels = labels[permutation]
#   return shuffled_dataset, shuffled_labels
# glbObsTrnFtr, glbObsTrnRsp = randomize(glbObsTrnFtr, glbObsTrnRsp)
# glbObsNewFtr, glbObsNewRsp = randomize(glbObsNewFtr, glbObsNewRsp)

Check if data is still good after shuffling!

In [132]:
print 'shuffled Trn set:'; mydisplayImages(glbObsTrnIdn, glbObsTrnFtr, glbObsTrnRsp)
#print 'shuffled New set:'; mydisplayImages(glbObsNewIdn, glbObsNewFtr, glbObsNewRsp)
shuffled Trn set:
  imgIx: 16454; id: img_64535.jpg; label: c2
  imgIx: 7390; id: img_17025.jpg; label: c2
  imgIx: 8078; id: img_86976.jpg; label: c0
  imgIx: 21232; id: img_6938.jpg; label: c2
  imgIx: 15937; id: img_51095.jpg; label: c9
  imgIx: 6158; id: img_21315.jpg; label: c7
  imgIx: 940; id: img_12231.jpg; label: c8
  imgIx: 5581; id: img_40944.jpg; label: c0
  imgIx: 247; id: img_46103.jpg; label: c2
  imgIx: 15779; id: img_65559.jpg; label: c0
shuffled New set:
  imgIx: 41597; id: img_83982.jpg; label: None
  imgIx: 47985; id: img_67052.jpg; label: None
  imgIx: 64193; id: img_62477.jpg; label: None
  imgIx: 68657; id: img_82737.jpg; label: None
  imgIx: 61216; id: img_88962.jpg; label: None
  imgIx: 68700; id: img_94374.jpg; label: None
  imgIx: 77282; id: img_25609.jpg; label: None
  imgIx: 43848; id: img_38003.jpg; label: None
  imgIx: 3589; id: img_47193.jpg; label: None
  imgIx: 3119; id: img_52674.jpg; label: None

Prune the training data as needed. Depending on your computer setup, you might not be able to fit it all in memory, and you can tune obsTrnN as needed.

Also create a validation dataset for hyperparameter tuning.

In [137]:
obsTrnN = glbObsTrnFtr.shape[0] # or fixed number e.g. 20000
obsVldN = int(obsTrnN * 0.2)
print 'obsTrnN: %d; obsVldN: %d' % (obsTrnN, obsVldN)

glbObsVldIdn = glbObsTrnIdn[:obsVldN]
glbObsVldFtr = glbObsTrnFtr[:obsVldN,:,:]
glbObsVldRsp = glbObsTrnRsp[:obsVldN]

glbObsFitIdn = glbObsTrnIdn[obsVldN:obsVldN+obsTrnN]
glbObsFitFtr = glbObsTrnFtr[obsVldN:obsVldN+obsTrnN,:,:]
glbObsFitRsp = glbObsTrnRsp[obsVldN:obsVldN+obsTrnN]

print('   Fitting:', len(glbObsFitIdn), glbObsFitFtr.shape, glbObsFitRsp.shape)
print('Validation:', len(glbObsVldIdn), glbObsVldFtr.shape, glbObsVldRsp.shape)
obsTrnN: 22424; obsVldN: 4484
('   Fitting:', 17940, (17940, 32, 32), (17940,))
('Validation:', 4484, (4484, 32, 32), (4484,))
In [71]:
# obsTrnN = glbObsTrnFtr.shape[0]
# #obsTrnN = 200000
# obsVldN = 10000

# glbObsVldFtr = glbObsTrnFtr[:obsVldN,:,:]
# glbObsVldRsp = glbObsTrnRsp[:obsVldN]
# glbObsTrnFtr = glbObsTrnFtr[obsVldN:obsVldN+obsTrnN,:,:]
# glbObsTrnRsp = glbObsTrnRsp[obsVldN:obsVldN+obsTrnN]
# print('Training', glbObsTrnFtr.shape, glbObsTrnRsp.shape)
# print('Validation', glbObsVldFtr.shape, glbObsVldRsp.shape)
In [146]:
print 'glbObsVldRsp class knts & Trn ratios: '
print (np.unique(glbObsVldRsp, return_counts = True))
print (np.unique(glbObsVldRsp, return_counts = True)[1] * 1.0 / 
       np.unique(glbObsTrnRsp, return_counts = True)[1])
glbObsVldRsp class knts & Trn ratios: 
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32), array([481, 468, 450, 453, 471, 466, 458, 400, 395, 442]))
[ 0.1932503   0.20644023  0.19421666  0.19309463  0.20249355  0.20155709
  0.19698925  0.1998002   0.20669806  0.20760921]

Finally, let's save the data for later reuse:
Remember to save previous pickled file as '_unshuffled'

In [75]:
# glbPickleFile = os.getcwd() + '/data/notMNIST.pickle'
# print glbPickleFile
In [138]:
try:
  f = open('data/' + glbPickleFile, 'wb')
  save = {
    'glbObsTrnIdn': glbObsTrnIdn,
    'glbObsTrnFtr': glbObsTrnFtr,
    'glbObsTrnRsp': glbObsTrnRsp,
        
    'glbObsFitIdn': glbObsFitIdn,        
    'glbObsFitFtr': glbObsFitFtr,
    'glbObsFitRsp': glbObsFitRsp,
        
    'glbObsVldIdn': glbObsVldIdn,        
    'glbObsVldFtr': glbObsVldFtr,
    'glbObsVldRsp': glbObsVldRsp,
        
    'glbObsNewIdn': glbObsNewIdn,        
    'glbObsNewFtr': glbObsNewFtr,
    'glbObsNewRsp': glbObsNewRsp,
    }
  pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
  f.close()
except Exception as e:
  print('Unable to save data to', glbPickleFile, ':', e)
  raise
    
statinfo = os.stat('data/' + glbPickleFile)
print('Compressed pickle size:', statinfo.st_size)       
('Compressed pickle size:', 512899134)
In [76]:
# #glbPickleFile = 'notMNIST.pickle'

# try:
#   f = open(glbPickleFile, 'wb')
#   save = {
#     'glbObsTrnFtr': glbObsTrnFtr,
#     'glbObsTrnRsp': glbObsTrnRsp,
#     'glbObsVldFtr': glbObsVldFtr,
#     'glbObsVldRsp': glbObsVldRsp,
#     'glbObsNewFtr': glbObsNewFtr,
#     'glbObsNewRsp': glbObsNewRsp,
#     }
#   pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
#   f.close()
# except Exception as e:
#   print('Unable to save data to', glbPickleFile, ':', e)
#   raise

Inspect overlap

By construction, this dataset might contain a lot of overlapping samples, including training data that's also contained in the validation and test set! Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur when you use it. Measure how much overlap there is between training, validation and test samples.

Optional questions:

  • What about near duplicates between datasets? (images that are almost identical)
  • Create a sanitized validation and test set, and compare your accuracy on those in subsequent assignments.

In [94]:
# print glbObsTrnFtr[0:3]
# print np.ascontiguousarray(glbObsTrnFtr[0:3])
# print np.ascontiguousarray(glbObsTrnFtr[0:3]).shape
In [139]:
obsFitSet = set(img.tostring() for img in glbObsFitFtr)
print 'Fit: shape: %s vs. len(set): %d pctDups: %0.4f' % \
    (glbObsFitFtr.shape, len(obsFitSet), \
     (glbObsFitFtr.shape[0] * 1.0 / len(obsFitSet) - 1) * 100)

obsVldSet = set(img.tostring() for img in glbObsVldFtr)
print 'Vld: shape: %s vs. len(set): %d pctDups: %0.4f' % \
    (glbObsVldFtr.shape, len(obsVldSet), \
     (glbObsVldFtr.shape[0] * 1.0 / len(obsVldSet) - 1) * 100)

obsNewSet = set(img.tostring() for img in glbObsNewFtr)
print 'New: shape: %s vs. len(set): %d pctDups: %0.4f' % \
    (glbObsNewFtr.shape, len(obsNewSet), \
     (glbObsNewFtr.shape[0] * 1.0 / len(obsNewSet) - 1) * 100) 
Fit: shape: (17940, 32, 32) vs. len(set): 17940 pctDups: 0.0000
Vld: shape: (4484, 32, 32) vs. len(set): 4484 pctDups: 0.0000
New: shape: (79726, 32, 32) vs. len(set): 79724 pctDups: 0.0025
In [79]:
#print glbObsTrnFtr[0:3]
# obsFitSet = set(img.tostring() for img in glbObsTrnFtr)
# print 'train: shape: %s vs. len(set): %d pctDups: %0.4f' % \
#     (glbObsTrnFtr.shape, len(obsFitSet), \
#      (glbObsTrnFtr.shape[0] * 1.0 / len(obsFitSet) - 1) * 100)

# validSet = set(img.tostring() for img in glbObsVldFtr)
# print 'valid: shape: %s vs. len(set): %d pctDups: %0.4f' % \
#     (glbObsVldFtr.shape, len(validSet), \
#      (glbObsVldFtr.shape[0] * 1.0 / len(validSet) - 1) * 100)

# obsNewSet = set(img.tostring() for img in glbObsNewFtr)
# print 'test : shape: %s vs. len(set): %d pctDups: %0.4f' % \
#     (glbObsNewFtr.shape, len(obsNewSet), \
#      (glbObsNewFtr.shape[0] * 1.0 / len(obsNewSet) - 1) * 100)    
In [142]:
print 'Vld set overlap with Fit set: %0.4f' % \
    (len(obsVldSet.intersection(obsFitSet)) * 1.0 / len(obsVldSet))
print 'Vld set overlap with New set: %0.4f' % \
    (len(obsVldSet.intersection(obsNewSet)) * 1.0 / len(obsNewSet))
print 'Fit set overlap with New set: %0.4f' % \
    (len(obsFitSet.intersection(obsNewSet)) * 1.0 / len(obsFitSet))
# print ' test set overlap with train set: %0.4f' % \
#     (len( obsNewSet.intersection(obsFitSet)) * 1.0 / len( obsNewSet))    
# print 'valid set overlap with  test set: %0.4f' % \
#     (len(validSet.intersection( obsNewSet)) * 1.0 / len(validSet))
Vld set overlap with Fit set: 0.0000
Vld set overlap with New set: 0.0000
Fit set overlap with New set: 0.0000

Stop here!

Following code is in img_02_fit_lgtRgr_SFDD

Let's get an idea of what an off-the-shelf classifier can give you on this data. It's always good to check that there is something to learn, and that it's a problem that is not so trivial that a canned solution solves it.

Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.

Optional question: train an off-the-shelf model on all the data!


In [110]:
# import graphlab
# print graphlab.version
# graphlab.canvas.set_target('ipynb')
1.8.1
In [ ]:
# graphlab.logistic_classifier.create(image_train,target='label',
#                                               features=['image_array'])
In [113]:
print glbObsTrnFtr[0:3,:,:]
print np.reshape(glbObsTrnFtr[0:3,:,:], (3, glbObsTrnFtr.shape[1] * glbObsTrnFtr.shape[2]))
print np.reshape(glbObsTrnFtr[0:3,:,:], (3, glbObsTrnFtr.shape[1] * glbObsTrnFtr.shape[2])).shape
[[[-0.5        -0.5        -0.5        ..., -0.49215686 -0.49607843 -0.5       ]
  [-0.5        -0.5        -0.5        ..., -0.44901961 -0.5        -0.49607843]
  [-0.5        -0.5        -0.5        ...,  0.29215688 -0.41764706 -0.5       ]
  ..., 
  [-0.5        -0.5        -0.5        ..., -0.49607843 -0.49607843
   -0.49607843]
  [-0.19019608  0.11176471  0.37450981 ..., -0.48823529 -0.49607843 -0.5       ]
  [ 0.24901961  0.34705883  0.19411765 ..., -0.49607843 -0.5        -0.5       ]]

 [[-0.5        -0.5        -0.5        ...,  0.5         0.5         0.5       ]
  [-0.5        -0.5        -0.5        ...,  0.5         0.5         0.5       ]
  [-0.5        -0.5        -0.5        ...,  0.5         0.5         0.5       ]
  ..., 
  [-0.43725491  0.04901961  0.38627452 ...,  0.45294118  0.22941177
   -0.30000001]
  [-0.5        -0.5        -0.3392157  ..., -0.20196079 -0.45686275 -0.5       ]
  [-0.49607843 -0.49215686 -0.49607843 ..., -0.5        -0.5        -0.49215686]]

 [[-0.5        -0.49607843 -0.43725491 ..., -0.5        -0.49607843 -0.5       ]
  [-0.40980393  0.11960784  0.42941177 ..., -0.24901961 -0.5        -0.49607843]
  [-0.03333334  0.5         0.48431373 ...,  0.41764706 -0.4254902  -0.5       ]
  ..., 
  [-0.5        -0.48039216 -0.06078431 ..., -0.5        -0.5        -0.5       ]
  [-0.5        -0.36666667  0.5        ..., -0.5        -0.5        -0.5       ]
  [-0.5        -0.39803922  0.28823531 ..., -0.5        -0.5        -0.5       ]]]
[[-0.5        -0.5        -0.5        ..., -0.49607843 -0.5        -0.5       ]
 [-0.5        -0.5        -0.5        ..., -0.5        -0.5        -0.49215686]
 [-0.5        -0.49607843 -0.43725491 ..., -0.5        -0.5        -0.5       ]]
(3, 784)
In [134]:
from sklearn import metrics, linear_model
import pandas as pd
In [171]:
def fitMdl(nFitObs = 50):
    mdl = linear_model.LogisticRegression(verbose = 1)
    mdl.fit(np.reshape(glbObsTrnFtr[0:nFitObs,:,:], \
                            (nFitObs, glbObsTrnFtr.shape[1] * glbObsTrnFtr.shape[2])), \
                 glbObsTrnRsp[0:nFitObs])
    print mdl.get_params()
    print mdl.coef_.shape
    print '  coeff stats:'
    for lblIx in xrange(len(dspLabels)):
        print '  label:%s; minCoeff:row:%2d, col:%2d, value:%0.4f; maxCoeff:row:%2d, col:%2d, value:%0.4f;' % \
            (dspLabels[lblIx], \
             mdl.coef_[lblIx,:].argmin() / glbImg['size'], \
             mdl.coef_[lblIx,:].argmin() % glbImg['size'], \
             mdl.coef_[lblIx,:].min(), \
             mdl.coef_[lblIx,:].argmax() / glbImg['size'], \
             mdl.coef_[lblIx,:].argmax() % glbImg['size'], \
             mdl.coef_[lblIx,:].max())

    train_pred_labels = mdl.predict(np.reshape(glbObsTrnFtr[0:nFitObs,:,:], \
                                                    (nFitObs               , glbImg['size'] ** 2)))
    accuracy_train = metrics.accuracy_score(train_pred_labels, glbObsTrnRsp[0:nFitObs])
    print '  accuracy train:%0.4f' % (accuracy_train)
    print metrics.confusion_matrix(glbObsTrnRsp[0:nFitObs], train_pred_labels)

    valid_pred_labels = mdl.predict(np.reshape(glbObsVldFtr, \
                                                    (glbObsVldFtr.shape[0], glbImg['size'] ** 2)))
    accuracy_valid = metrics.accuracy_score(valid_pred_labels, glbObsVldRsp)
    print '  accuracy valid:%0.4f' % (accuracy_valid)
    print metrics.confusion_matrix(glbObsVldRsp           , valid_pred_labels)

    test_pred_labels  = mdl.predict(np.reshape(glbObsNewFtr, \
                                                    (glbObsNewFtr.shape[0], glbImg['size'] ** 2)))
    accuracy_test = metrics.accuracy_score( test_pred_labels,  glbObsNewRsp)
    print '  accuracy  test:%0.4f' % (accuracy_test)
    test_conf = pd.DataFrame(metrics.confusion_matrix( glbObsNewRsp,  test_pred_labels), \
                             index = dspLabels, columns = dspLabels)
    print test_conf
    
    return(mdl, (accuracy_train, accuracy_valid, accuracy_test))
In [172]:
mdl50 = fitMdl(nFitObs = 50) 
[LibLinear]{'warm_start': False, 'C': 1.0, 'n_jobs': 1, 'verbose': 1, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 100, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': False, 'tol': 0.0001, 'solver': 'liblinear', 'class_weight': None}
(10, 784)
  coeff stats:
  label:A; minCoeff:row:26, col: 8, value:-0.2571; maxCoeff:row:24, col:25, value:0.1487;
  label:B; minCoeff:row: 2, col:20, value:-0.2250; maxCoeff:row:16, col:23, value:0.2356;
  label:C; minCoeff:row:26, col: 4, value:-0.2084; maxCoeff:row:25, col:26, value:0.2056;
  label:D; minCoeff:row:25, col: 7, value:-0.1682; maxCoeff:row: 9, col:25, value:0.1925;
  label:E; minCoeff:row: 1, col:19, value:-0.1914; maxCoeff:row:25, col:27, value:0.2057;
  label:F; minCoeff:row: 1, col:19, value:-0.1759; maxCoeff:row: 2, col: 1, value:0.2158;
  label:G; minCoeff:row: 1, col:19, value:-0.2289; maxCoeff:row:11, col: 0, value:0.1832;
  label:H; minCoeff:row:26, col: 9, value:-0.2210; maxCoeff:row:27, col:27, value:0.1907;
  label:I; minCoeff:row: 0, col:14, value:-0.1343; maxCoeff:row:27, col:27, value:0.2123;
  label:J; minCoeff:row:13, col: 9, value:-0.1960; maxCoeff:row: 0, col:21, value:0.1679;
  accuracy train:1.0000
[[5 0 0 0 0 0 0 0 0 0]
 [0 6 0 0 0 0 0 0 0 0]
 [0 0 4 0 0 0 0 0 0 0]
 [0 0 0 4 0 0 0 0 0 0]
 [0 0 0 0 6 0 0 0 0 0]
 [0 0 0 0 0 4 0 0 0 0]
 [0 0 0 0 0 0 6 0 0 0]
 [0 0 0 0 0 0 0 2 0 0]
 [0 0 0 0 0 0 0 0 4 0]
 [0 0 0 0 0 0 0 0 0 9]]
  accuracy valid:0.5822
[[682  27   6  28  28  18  33  59  31 121]
 [ 24 671  18  48  33  21  73  14  63  49]
 [ 37  39 574  35 161   2 102   1  28  29]
 [ 24  53  16 698  14  21  60   5  31  63]
 [ 51 215 118  13 377   9  36   8  84  45]
 [ 55 173  17  18 168 437  21   4  16  53]
 [ 43  46 216  37  60  20 513   3  20  80]
 [ 79 101   8  30  90 160  37 385  50  35]
 [ 47  20  38  10  60   7  32   7 625 188]
 [ 26  11   9  16  17  16  26   1  13 860]]
  accuracy  test:0.6381
      A     B     C     D    E    F     G    H     I     J
A  1283    43     9    34   33   40    42  103    34   251
B    38  1448    19    65   41   23    73   16    77    73
C    27    65  1241    53  271    5   137    1    44    29
D    24    92    32  1474   25   30    68    5    43    80
E    46   495   210    14  818   15    43   13   175    44
F    93   371    16    22  302  919    25    9    29    86
G    67    61   417    60   76   16  1015    6    37   117
H   144   182    18    39  206  241    44  793   154    51
I    33    21    36    20  122   16    26   10  1223   365
J    19     9    13    22   21   15    16    1    22  1734
In [181]:
models = pd.DataFrame({'nFitObs': [1e2, 1e3, 1e4, 1e5, glbObsTrnFtr.shape[0]]})
models = models.set_index(models['nFitObs'])
models['mdl'] = linear_model.LogisticRegression()
models['accuracy.fit'] = -1; models['accuracy.vld'] = -1; models['accuracy.new'] = -1

for thsN in models['nFitObs']: 
    models.ix[thsN, 'mdl'], (models.ix[thsN, 'accuracy.fit'], \
                             models.ix[thsN, 'accuracy.vld'], \
                             models.ix[thsN, 'accuracy.new'], \
                            ) = fitMdl(nFitObs = thsN)
    
print models
[LibLinear]{'warm_start': False, 'C': 1.0, 'n_jobs': 1, 'verbose': 1, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 100, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': False, 'tol': 0.0001, 'solver': 'liblinear', 'class_weight': None}
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:3: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  app.launch_new_instance()
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:10: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
(10, 784)
  coeff stats:
  label:A; minCoeff:row:26, col: 8, value:-0.3014; maxCoeff:row:17, col:17, value:0.2229;
  label:B; minCoeff:row: 1, col:19, value:-0.2240; maxCoeff:row:16, col:23, value:0.3035;
  label:C; minCoeff:row:26, col: 8, value:-0.2396; maxCoeff:row:25, col:15, value:0.1714;
  label:D; minCoeff:row:14, col:19, value:-0.2523; maxCoeff:row:27, col: 1, value:0.2116;
  label:E; minCoeff:row: 9, col:19, value:-0.2736; maxCoeff:row:11, col:11, value:0.2807;
  label:F; minCoeff:row:26, col:19, value:-0.3569; maxCoeff:row: 2, col: 2, value:0.2562;
  label:G; minCoeff:row: 1, col:19, value:-0.2610; maxCoeff:row:18, col:27, value:0.2457;
  label:H; minCoeff:row:26, col:10, value:-0.2259; maxCoeff:row: 0, col:27, value:0.1947;
  label:I; minCoeff:row:15, col:18, value:-0.2584; maxCoeff:row:27, col:27, value:0.2571;
  label:J; minCoeff:row:24, col: 5, value:-0.2323; maxCoeff:row: 0, col:27, value:0.2230;
  accuracy train:1.0000
[[11  0  0  0  0  0  0  0  0  0]
 [ 0  9  0  0  0  0  0  0  0  0]
 [ 0  0  9  0  0  0  0  0  0  0]
 [ 0  0  0  6  0  0  0  0  0  0]
 [ 0  0  0  0 16  0  0  0  0  0]
 [ 0  0  0  0  0  9  0  0  0  0]
 [ 0  0  0  0  0  0 12  0  0  0]
 [ 0  0  0  0  0  0  0  6  0  0]
 [ 0  0  0  0  0  0  0  0 10  0]
 [ 0  0  0  0  0  0  0  0  0 12]]
  accuracy valid:0.6829
[[728  26  10  17  45  13  45  39  22  88]
 [ 23 692  14  42  67  24  49  20  51  32]
 [ 16  22 699  13 128   6  70   6  20  28]
 [ 16  54  20 694  32  23  38  22  27  59]
 [ 18  24  78   4 606  52  25  36  76  37]
 [ 13  13  17   2 161 671  16   9  18  42]
 [ 20  28  93  17  53  21 733   5  20  48]
 [ 57  44  10  23 118 115  34 516  30  28]
 [ 30  10  31   7  85  24  27  13 651 156]
 [ 18   7  13  16  27  23  27   4  21 839]]
  accuracy  test:0.7498
      A     B     C     D     E     F     G     H     I     J
A  1382    33     7    22    46    41    61    50    26   204
B    31  1485    18    44    84    28    41    14    65    63
C     4    38  1484    15   152     6   119     6    28    21
D    17    72    28  1502    35    38    47    14    36    84
E    27    92   153     7  1199   115    24    62   166    28
F    23     3    13     7   247  1433    15    36    26    69
G    26    41   152    11    71    21  1462     8    21    59
H    97    48    20    33   238   175    42  1092    81    46
I    29    22    25    15   105    36    18    16  1310   296
J     7    12    18    26    23    36    16     8    36  1690
[LibLinear]{'warm_start': False, 'C': 1.0, 'n_jobs': 1, 'verbose': 1, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 100, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': False, 'tol': 0.0001, 'solver': 'liblinear', 'class_weight': None}
(10, 784)
  coeff stats:
  label:A; minCoeff:row: 4, col: 7, value:-0.7170; maxCoeff:row:27, col:27, value:0.6456;
  label:B; minCoeff:row: 0, col:26, value:-1.0239; maxCoeff:row:18, col:27, value:0.6353;
  label:C; minCoeff:row:15, col:16, value:-0.6154; maxCoeff:row: 7, col:26, value:0.6948;
  label:D; minCoeff:row: 1, col:27, value:-0.6937; maxCoeff:row:13, col:26, value:0.6355;
  label:E; minCoeff:row:17, col:26, value:-1.0853; maxCoeff:row:14, col:21, value:0.7244;
  label:F; minCoeff:row:11, col:13, value:-0.4294; maxCoeff:row: 2, col: 1, value:0.4680;
  label:G; minCoeff:row:12, col:18, value:-0.7411; maxCoeff:row:15, col:14, value:0.6454;
  label:H; minCoeff:row: 0, col:15, value:-0.7681; maxCoeff:row: 0, col:27, value:1.0599;
  label:I; minCoeff:row:23, col:18, value:-0.7867; maxCoeff:row:24, col: 2, value:0.7645;
  label:J; minCoeff:row:27, col: 7, value:-0.7077; maxCoeff:row: 0, col:27, value:0.9120;
  accuracy train:0.9950
[[110   0   0   0   0   0   0   0   0   0]
 [  1 106   0   0   0   0   0   0   0   0]
 [  0   0  99   0   0   0   0   0   0   0]
 [  0   0   0  91   0   0   0   0   0   0]
 [  0   0   0   0 102   0   0   0   0   0]
 [  0   0   0   0   0  83   0   0   0   0]
 [  0   0   0   0   0   0 102   0   0   0]
 [  0   0   0   0   0   0   0 101   0   0]
 [  0   0   0   0   0   0   1   1  96   1]
 [  0   0   0   0   0   0   0   0   1 105]]
  accuracy valid:0.7580
[[797  15  15  25  19  15  20  50  32  45]
 [ 28 740  19  60  32  17  31  19  38  30]
 [ 28  17 779  12  48  10  51  10  25  28]
 [ 27  33  11 768  16  17  32  10  31  40]
 [ 24  35  62  11 642  23  35  35  57  32]
 [ 19  19  16  12  53 751  18  10  29  35]
 [ 26  29  54  22  37  19 774  12  22  43]
 [ 36  24  15  25  32  29  24 734  25  31]
 [ 23  14  14  20  34  20  30  21 769  89]
 [ 26   5  13  22  15  24  22  13  29 826]]
  accuracy  test:0.8342
      A     B     C     D     E     F     G     H     I     J
A  1543     8    13    22    20    24    41    94    33    74
B    27  1550    18    57    57    21    42    21    42    38
C    23    23  1627    12    63    10    58     7    22    28
D    21    35    25  1627    25    15    30    11    41    43
E    27   108   104     9  1401    26    28    56    84    30
F    23    19    15    21    53  1614    19    14    38    56
G    34    19   104    22    48    17  1513    23    28    64
H    57    36    19    26    49    23    31  1541    45    45
I    36    13    14    17    43    22    30    19  1520   158
J    24     7    12    23    11    33    25     7    46  1684
[LibLinear]{'warm_start': False, 'C': 1.0, 'n_jobs': 1, 'verbose': 1, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 100, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': False, 'tol': 0.0001, 'solver': 'liblinear', 'class_weight': None}
(10, 784)
  coeff stats:
  label:A; minCoeff:row:23, col:13, value:-1.6050; maxCoeff:row:27, col:27, value:1.5933;
  label:B; minCoeff:row:14, col:27, value:-1.6509; maxCoeff:row:10, col:18, value:1.4947;
  label:C; minCoeff:row: 9, col:16, value:-1.4944; maxCoeff:row: 6, col:10, value:1.3965;
  label:D; minCoeff:row: 0, col:27, value:-1.3080; maxCoeff:row:16, col:21, value:1.3399;
  label:E; minCoeff:row:19, col: 6, value:-1.3329; maxCoeff:row: 8, col: 8, value:1.4913;
  label:F; minCoeff:row:12, col:23, value:-1.5250; maxCoeff:row:19, col: 2, value:1.3375;
  label:G; minCoeff:row:13, col: 5, value:-1.4522; maxCoeff:row:17, col:15, value:1.6977;
  label:H; minCoeff:row: 0, col:15, value:-1.7416; maxCoeff:row:16, col:15, value:1.5178;
  label:I; minCoeff:row:26, col:17, value:-1.2247; maxCoeff:row:23, col:10, value:1.4428;
  label:J; minCoeff:row: 3, col: 4, value:-1.2762; maxCoeff:row:17, col:10, value:1.3836;
  accuracy train:0.8983
[[939   9   9  10   9   7  12  27  15  13]
 [  5 889   3  26  18   5  12  13  17   5]
 [  3   7 922   3  13   2  10   6  14   5]
 [  9   9   3 906   8   8  13  10  11   5]
 [ 10  17  30   8 833  12  19  11  35   5]
 [  8   1   3   9   5 902  13   7  14   6]
 [  9   5  27  18   9  11 887  17  23  13]
 [ 19   6   2   7   8   7  18 892  24   7]
 [ 14   8   7  20  10  20  12  20 887  40]
 [  5   5   3   5   2  13   1   9  26 926]]
  accuracy valid:0.7892
[[829   9  16  19  24  13  23  45  33  22]
 [ 20 758  18  42  31  25  35  28  38  19]
 [ 16  20 827  14  29  16  35  18  23  10]
 [ 21  28   8 791  11  23  21  20  27  35]
 [ 19  28  73   9 682  29  29  29  44  14]
 [ 16  13  25  13  23 786  19  19  34  14]
 [ 15  10  60  25  22  25 822  17  21  21]
 [ 40  11  16  20  20  32  23 771  28  14]
 [ 25  12  19  18  22  19  17  36 814  52]
 [ 15  11  14  24  10  29  18  19  43 812]]
  accuracy  test:0.8602
      A     B     C     D     E     F     G     H     I     J
A  1617    20    19    15    26    15    27    57    24    52
B    20  1583    13    56    54    28    35    24    32    28
C     7    15  1692     9    53    13    47     5    20    12
D    26    49    18  1637    16    17    21    13    34    42
E    27    62    95     9  1502    38    28    29    61    22
F    22    21    24    11    30  1660    17    12    46    29
G    28    23    78    27    35    28  1577    23    20    33
H    68    20    19    19    35    29    24  1593    33    32
I    25    11    17    26    39    31    28    35  1572    88
J    22     7    18    17     7    33    21     7    66  1674
[LibLinear]{'warm_start': False, 'C': 1.0, 'n_jobs': 1, 'verbose': 1, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 100, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': False, 'tol': 0.0001, 'solver': 'liblinear', 'class_weight': None}
(10, 784)
  coeff stats:
  label:A; minCoeff:row:27, col:10, value:-1.1414; maxCoeff:row: 6, col: 2, value:1.1745;
  label:B; minCoeff:row: 0, col:26, value:-1.7742; maxCoeff:row:20, col:27, value:1.0678;
  label:C; minCoeff:row: 1, col: 3, value:-0.9762; maxCoeff:row: 1, col: 4, value:1.0519;
  label:D; minCoeff:row: 0, col:26, value:-1.4436; maxCoeff:row:14, col: 9, value:0.8546;
  label:E; minCoeff:row:20, col:27, value:-0.9845; maxCoeff:row: 8, col: 0, value:1.8054;
  label:F; minCoeff:row:23, col:26, value:-1.1607; maxCoeff:row: 1, col:27, value:0.7309;
  label:G; minCoeff:row:13, col:12, value:-1.0996; maxCoeff:row:19, col:17, value:0.9612;
  label:H; minCoeff:row: 0, col:14, value:-1.1564; maxCoeff:row: 0, col:27, value:0.7920;
  label:I; minCoeff:row:15, col: 2, value:-0.7929; maxCoeff:row: 9, col: 2, value:0.9165;
  label:J; minCoeff:row:13, col: 0, value:-1.0014; maxCoeff:row: 0, col:26, value:1.4257;
  accuracy train:0.8344
[[8411  135  112  133  105  133  171  380  236  237]
 [ 147 8005  104  406  236  141  222  174  242  122]
 [  77  102 8744  117  203  107  252  116  199   84]
 [ 151  233   99 8443   86  158  150  161  220  142]
 [ 144  195  549  119 7692  299  283  177  444  139]
 [ 100   62  107   93  109 8753  156  125  260  199]
 [ 168  136  401  145  147  195 8363  125  267  209]
 [ 317  121   95  154  159  159  171 8556  304  148]
 [ 184  129  147  164  171  220  194  274 7897  597]
 [ 131   90   87  157   71  192  151  123  405 8575]]
  accuracy valid:0.8206
[[857  16  12  14  12  10  19  38  29  26]
 [ 17 787  12  53  33  26  25  20  26  15]
 [  7  10 856  12  30   9  35  16  23  10]
 [ 14  23   6 829   9  16  26  16  25  21]
 [ 11  25  62   9 723  24  28  25  36  13]
 [ 12   9  15  11  14 824  19  11  32  15]
 [ 11  12  46  18  15  30 848  11  25  22]
 [ 32  10   7  14  20  16  20 814  27  15]
 [ 19   9  21  11  13  19  23  38 831  50]
 [ 13   6   9  23  10  31  11  15  40 837]]
  accuracy  test:0.8891
      A     B     C     D     E     F     G     H     I     J
A  1659    12    14     8    14     8    29    65    23    40
B    11  1643    11    62    33    25    29    18    28    13
C     6     8  1742     4    27    22    27     7    23     7
D    16    28     9  1718    12    18     8    16    24    24
E    13    47    74    13  1568    35    32    14    65    12
F    16     7    16     7    10  1722    19     5    34    36
G    18    16    67    17    16    42  1636    17    27    16
H    50    20    11    12    28    27    21  1651    39    13
I    27    10    12    18    28    30    25    24  1603    95
J    16     2     7    18    10    36    15    10    52  1706
[LibLinear]{'warm_start': False, 'C': 1.0, 'n_jobs': 1, 'verbose': 1, 'intercept_scaling': 1, 'fit_intercept': True, 'max_iter': 100, 'penalty': 'l2', 'multi_class': 'ovr', 'random_state': None, 'dual': False, 'tol': 0.0001, 'solver': 'liblinear', 'class_weight': None}
(10, 784)
  coeff stats:
  label:A; minCoeff:row:27, col:12, value:-1.0494; maxCoeff:row:27, col:27, value:1.1473;
  label:B; minCoeff:row: 0, col:26, value:-1.8980; maxCoeff:row:20, col:27, value:0.8489;
  label:C; minCoeff:row:27, col: 1, value:-0.8797; maxCoeff:row:13, col: 0, value:0.7399;
  label:D; minCoeff:row: 0, col:27, value:-1.1843; maxCoeff:row:27, col: 5, value:0.7192;
  label:E; minCoeff:row: 7, col: 0, value:-0.8467; maxCoeff:row: 8, col: 0, value:1.0877;
  label:F; minCoeff:row:27, col:27, value:-0.8682; maxCoeff:row: 8, col:27, value:0.8271;
  label:G; minCoeff:row:27, col: 0, value:-0.8403; maxCoeff:row:15, col:27, value:0.7896;
  label:H; minCoeff:row: 0, col:13, value:-1.1541; maxCoeff:row: 0, col:27, value:0.8937;
  label:I; minCoeff:row:26, col: 3, value:-0.6741; maxCoeff:row:20, col: 0, value:0.8260;
  label:J; minCoeff:row:27, col: 1, value:-0.9333; maxCoeff:row: 0, col:26, value:1.4196;
  accuracy train:0.8295
[[43020   771   559   789   559   637   933  2049  1217  1342]
 [  746 41687   527  2334  1263   909  1196   958  1432   845]
 [  375   602 45251   469  1162   515  1458   509  1130   433]
 [  729  1281   474 44419   427   904   772   792  1153   975]
 [  630  1114  2918   652 39753  1624  1319   984  2316   646]
 [  565   385   601   486   520 45425   869   661  1422  1016]
 [  895   836  2280   768   740  1017 42364   672  1282  1020]
 [ 1577   749   436   820   769   875   838 43527  1546   800]
 [ 1046   727   713   961   934  1094  1018  1338 40968  3079]
 [  826   477   491   896   390  1032   746   606  2257 44195]]
  accuracy valid:0.8252
[[872  14  12  14  10   7  17  35  27  25]
 [ 20 797  12  54  32  25  24  14  26  10]
 [  8  14 861  14  26  12  31  11  20  11]
 [ 13  28   6 832  10  12  14  17  33  20]
 [ 15  20  60  13 729  24  24  22  37  12]
 [ 13   7  11  10  15 832  17  14  28  15]
 [ 12  15  41  18  10  32 857  13  20  20]
 [ 38   7   5  10  22  21  21 808  27  16]
 [ 21  13  19   9  19  21  23  29 830  50]
 [ 11   9  13  22   9  28  12  15  42 834]]
  accuracy  test:0.8938
      A     B     C     D     E     F     G     H     I     J
A  1661    17    10    14    16    10    19    55    26    44
B    11  1651    12    62    25    27    28    14    27    16
C     4     4  1750     6    25    20    28     7    22     7
D    15    24     9  1726    10    23     9    10    25    22
E    11    54    74    12  1575    34    23    13    60    17
F    13     7    12     4     6  1750    11     6    24    39
G    17    18    60    13    11    40  1653    15    27    18
H    49    18    11    16    29    23    20  1648    38    20
I    24     8    12    19    28    30    25    22  1605    99
J    14     5    11    16     6    36     9     5    53  1717
         nFitObs                                                mdl  \
nFitObs                                                               
100          100  LogisticRegression(C=1.0, class_weight=None, d...   
1000        1000  LogisticRegression(C=1.0, class_weight=None, d...   
10000      10000  LogisticRegression(C=1.0, class_weight=None, d...   
100000    100000  LogisticRegression(C=1.0, class_weight=None, d...   
519114    519114  LogisticRegression(C=1.0, class_weight=None, d...   

         accuracy.fit  accuracy.vld  accuracy.new  
nFitObs                                            
100          1.000000        0.6829      0.749786  
1000         0.995000        0.7580      0.834223  
10000        0.898300        0.7892      0.860233  
100000       0.834390        0.8206      0.889126  
519114       0.829508        0.8252      0.893826  
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:11: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:13: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
In [192]:
plt.figure()
plt.plot(models['nFitObs'], models['accuracy.fit'], 'bo-', label = 'fit')
plt.plot(models['nFitObs'], models['accuracy.vld'], 'rs-', label = 'vld')
plt.plot(models['nFitObs'], models['accuracy.new'], 'gp-', label = 'new')
plt.legend()
plt.title("Accuracy")
plt.xscale('log')
axes = plt.gca()
axes.set_xlabel('nFitObs')
# axes.set_xlim([mdlDF['l1_penalty'][mdlDF['RSS.vld'].argmin()] / 10 ** 2, \
#                mdlDF['l1_penalty'][mdlDF['RSS.vld'].argmin()] * 10 ** 2])
# axes.set_ylim([0, mdlDF['RSS.vld'].min() * 1.5])
plt.show()
In [ ]:
 
In [ ]:
 
In [123]:
print dspLabels
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
In [154]:
import pandas as pd
[INFO] This non-commercial license of GraphLab Create is assigned to bbalaji8@gmail.com and will expire on December 09, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-18168 - Server binary: /usr/local/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1454417383.log
[INFO] GraphLab Server Version: 1.8.1
In [ ]: